xxxxxxxxxx### Sources- Merging dataframes: https://stackoverflow.com/questions/72925436/how-to-merge-multiple-6-dataframes-together-based-on-one-common-column-in-pyth- Cross validation: https://scikit-learn.org/stable/modules/cross_validation.html- Heteroskedasticity check: https://seaborn.pydata.org/generated/seaborn.residplot.html- Two sample t-test: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html- VIF Testing: https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/xxxxxxxxxx## IntroductionOur mental health is like the engine that keeps our everyday lives running smoothly. It's not just about feeling happy or sad – it's about how we think, how we cope with challenges, how we connect with others and overall interact with the world around us. Whether it's at work, in our relationships, or when facing life's ups and downs, our mental health is a crucial player in every aspect of our daily functionality.This fundamental role of mental health in every aspect of our daily life sparked a strong desire to dive deeper into it, to understand it better. This curiosity became the driving force behind our exploration. Recognizing the importance of mental health in our day-to-day functionality and overall activities, we embarked on this journey to uncover more about its complexities. As we navigate through this analysis, we aim to unravel the complex connections between socio-economic factors and mental health, seeking insights that can enhance our understanding and contribute to creating more effective solutions and policies for the well-being of individuals and communities alike.Throughout the analysis, we uncover various factors, such as the Human Development Index, Life Expectancy, GDP, Unemployment Rate, and Urbanization Rate, that can predict potential influence on the mental health profiles of diverse countries. The dataset examines sourced from kaggle.com, provides a comprehensive perspective on mental health prevalence across multiple countries and years. By offering detailed insights into the prevalence rates of specific mental health disorders such as Schizophrenia, Bipolar disorder, Eating disorders, Anxiety disorders, Drug use disorders, Depression, and Alcohol use disorders, this dataset becomes an invaluable resource for revealing trends and patterns in mental health.Our mental health is like the engine that keeps our everyday lives running smoothly. It's not just about feeling happy or sad – it's about how we think, how we cope with challenges, how we connect with others and overall interact with the world around us. Whether it's at work, in our relationships, or when facing life's ups and downs, our mental health is a crucial player in every aspect of our daily functionality.
This fundamental role of mental health in every aspect of our daily life sparked a strong desire to dive deeper into it, to understand it better. This curiosity became the driving force behind our exploration. Recognizing the importance of mental health in our day-to-day functionality and overall activities, we embarked on this journey to uncover more about its complexities. As we navigate through this analysis, we aim to unravel the complex connections between socio-economic factors and mental health, seeking insights that can enhance our understanding and contribute to creating more effective solutions and policies for the well-being of individuals and communities alike.
Throughout the analysis, we uncover various factors, such as the Human Development Index, Life Expectancy, GDP, Unemployment Rate, and Urbanization Rate, that can predict potential influence on the mental health profiles of diverse countries. The dataset examines sourced from kaggle.com, provides a comprehensive perspective on mental health prevalence across multiple countries and years. By offering detailed insights into the prevalence rates of specific mental health disorders such as Schizophrenia, Bipolar disorder, Eating disorders, Anxiety disorders, Drug use disorders, Depression, and Alcohol use disorders, this dataset becomes an invaluable resource for revealing trends and patterns in mental health.
xxxxxxxxxx## Research Question:Can we use different socio-economic factors like the Human Development Index, Life Expectancy, GDP, Unemployment Rate, and Urbanization Rate to predict Mental Health Rate in different countries in the future?Can we use different socio-economic factors like the Human Development Index, Life Expectancy, GDP, Unemployment Rate, and Urbanization Rate to predict Mental Health Rate in different countries in the future?
xxxxxxxxxximport numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltimport duckdbfrom functools import reduceimport statsmodels.api as smimport statsmodels.formula.api as smffrom scipy.stats import linregressfrom statsmodels.api import OLSfrom sklearn.model_selection import train_test_split, KFold, cross_val_scorefrom sklearn.linear_model import LinearRegressionfrom scipy.stats import ttest_indfrom sklearn.preprocessing import StandardScalerimport itertoolsfrom statsmodels.stats.outliers_influence import variance_inflation_factorfrom sklearn.metrics import mean_squared_error, mean_absolute_error, \mean_absolute_percentage_error, accuracy_score, precision_score, recall_score, \f1_score, precision_recall_curvexxxxxxxxxx### Data Description:- This is a pretty dense dataset that contains information aboout the prevalence of mental health disorders in multiple countries. Each row has information about a certain country or region for a certain year. The columns include Entity (the country or region name), Code(the code for the country or region), Year, as well as the percentage of people with specific disorders as 'Schizophrenia (%)', 'Bipolar disorder (%)', 'Eating disorders (%)', 'Anxiety disorders (%)','Drug use disorders (%)', 'Depression (%)', and 'Alcohol use disorders (%)'. - The dataset was obtained from Kaggle.com and was compiled by a person that goes by the pseudonym "Amit". It was created for the purpose of understanding whether or not the prevalence rates across different types of mental illnesses increase or decrease over time. Overall, the dataset allws for a deeper understanding of different mental health conditions and their implications when it comes to impacting lives. - Source:https://www.kaggle.com/datasets/thedevastator/uncover-global-trends-in-mental-health-disorder**Purpose** \This dataset was originally created and funded by Our World in Data. Saloni Dattani analyzed the processes that might have influenced what data was observed and recorded by the researchers and found that they used surveys and screening questionnaires to create this data. During this process, healthcare professionals asked people questions about their symptoms and these surveys were conducted anonymously in-person, online and over the phone. **Data Collection**\Subjectivity in Diagnosis: The diagnosis of mental illnesses relies on subjective symptoms and behaviors reported by individuals. This subjectivity can influence what data is observed, as individuals may interpret and express their symptoms differently. Cultural and Legal Factors: Changes in the definitions of mental illnesses are influenced by cultural and legal factors. This may impact what symptoms are considered part of a mental illness, shaping the observed data in different countries.Healthcare Disparities: Variations in healthcare systems, including accessibility, may lead to disparities in the observed data. Limited access to healthcare might result in underreported cases, particularly in regions with inadequate mental health support. Lack of Public Awareness: Some individuals may not seek help due to a lack of awareness or discomfort in sharing symptoms. This influences the data by potentially underrepresenting the true prevalence of mental illnesses. Differing Definitions Across Countries: Countries may use different definitions to diagnose patients, impacting the comparability of data between nations. Awareness of these differences is crucial for interpreting and utilizing the collected data effectively. Treatment-Related Data Expectations: Individuals seeking mental health treatment may have varied expectations regarding the use of their data. Understanding these expectations is essential for ethical data use and maintaining public trust.**Pre-processing** \Hospital-Centric Data Collection: Data is primarily collected from hospitals, excluding clinic visits. This preprocessing choice may lead to an incomplete representation of mental health cases, as people seeking care in clinics are not fully accounted for.Exclusion of Private Healthcare Data: National data on mental health diagnoses often excludes information from private hospitals and clinics. This preprocessing step may introduce bias by excluding a significant portion of mental health cases especially for this dataset since it looks at a lot of countries. For this dataset, we removed all NaN values and subsequently provided an enhanced, cleaned dataframe.The raw data had "Prevalence in Males" and "Prevalence in Females" columns that had values for only a few countries which was not uniform across other datasets and would promote more biases. So, we opted to exclude these rows. Additionally, some entries in the Years column were in strings format so we converted them to numerical values. To make this dataset more uniform, overall, we removed NaN values and constrained the years to the range between 2000 and 2015, aligning with the common timeframe across our datasets.How this data came to be in the form that we are using:Initially, healthcare professionals and mental health specialists, diagnose mental illnesses based on established criteria outlined in manuals like the International Classification of Diseases (ICD) and the Diagnostic and Statistical Manual of Mental Disorders (DSM). This diagnostic information is then collected from hospitals in numerous countries, encompassing details such as age, sex, reason for admission, diagnoses, and treatments. The survey data on mental health is gathered through structured interviews with a diverse range of participants, offering a broader representation of the population, including those who might not seek treatment. Subsequent data preprocessing steps involve excluding private healthcare data, potentially introducing biases, and dealing with limitations in data sources. The global estimation methods employ statistical approaches, adjusting for demographic variations and differences in data sources and collection periods.This is a pretty dense dataset that contains information aboout the prevalence of mental health disorders in multiple countries. Each row has information about a certain country or region for a certain year. The columns include Entity (the country or region name), Code(the code for the country or region), Year, as well as the percentage of people with specific disorders as 'Schizophrenia (%)', 'Bipolar disorder (%)', 'Eating disorders (%)', 'Anxiety disorders (%)','Drug use disorders (%)', 'Depression (%)', and 'Alcohol use disorders (%)'.
The dataset was obtained from Kaggle.com and was compiled by a person that goes by the pseudonym "Amit". It was created for the purpose of understanding whether or not the prevalence rates across different types of mental illnesses increase or decrease over time. Overall, the dataset allws for a deeper understanding of different mental health conditions and their implications when it comes to impacting lives.
Source: https://www.kaggle.com/datasets/thedevastator/uncover-global-trends-in-mental-health-disorder
Purpose
This dataset was originally created and funded by Our World in Data. Saloni Dattani analyzed the processes that might have influenced what data was observed and recorded by the researchers and found that they used surveys and screening questionnaires to create this data. During this process, healthcare professionals asked people questions about their symptoms and these surveys were conducted anonymously in-person, online and over the phone.
Data Collection
Subjectivity in Diagnosis: The diagnosis of mental illnesses relies on subjective symptoms and behaviors reported by individuals. This subjectivity can influence what data is observed, as individuals may interpret and express their symptoms differently.
Cultural and Legal Factors: Changes in the definitions of mental illnesses are influenced by cultural and legal factors. This may impact what symptoms are considered part of a mental illness, shaping the observed data in different countries.
Healthcare Disparities: Variations in healthcare systems, including accessibility, may lead to disparities in the observed data. Limited access to healthcare might result in underreported cases, particularly in regions with inadequate mental health support.
Lack of Public Awareness: Some individuals may not seek help due to a lack of awareness or discomfort in sharing symptoms. This influences the data by potentially underrepresenting the true prevalence of mental illnesses.
Differing Definitions Across Countries: Countries may use different definitions to diagnose patients, impacting the comparability of data between nations. Awareness of these differences is crucial for interpreting and utilizing the collected data effectively.
Treatment-Related Data Expectations: Individuals seeking mental health treatment may have varied expectations regarding the use of their data. Understanding these expectations is essential for ethical data use and maintaining public trust.
Pre-processing
Hospital-Centric Data Collection: Data is primarily collected from hospitals, excluding clinic visits. This preprocessing choice may lead to an incomplete representation of mental health cases, as people seeking care in clinics are not fully accounted for.
Exclusion of Private Healthcare Data: National data on mental health diagnoses often excludes information from private hospitals and clinics. This preprocessing step may introduce bias by excluding a significant portion of mental health cases especially for this dataset since it looks at a lot of countries.
For this dataset, we removed all NaN values and subsequently provided an enhanced, cleaned dataframe. The raw data had "Prevalence in Males" and "Prevalence in Females" columns that had values for only a few countries which was not uniform across other datasets and would promote more biases. So, we opted to exclude these rows. Additionally, some entries in the Years column were in strings format so we converted them to numerical values. To make this dataset more uniform, overall, we removed NaN values and constrained the years to the range between 2000 and 2015, aligning with the common timeframe across our datasets. How this data came to be in the form that we are using: Initially, healthcare professionals and mental health specialists, diagnose mental illnesses based on established criteria outlined in manuals like the International Classification of Diseases (ICD) and the Diagnostic and Statistical Manual of Mental Disorders (DSM). This diagnostic information is then collected from hospitals in numerous countries, encompassing details such as age, sex, reason for admission, diagnoses, and treatments. The survey data on mental health is gathered through structured interviews with a diverse range of participants, offering a broader representation of the population, including those who might not seek treatment. Subsequent data preprocessing steps involve excluding private healthcare data, potentially introducing biases, and dealing with limitations in data sources. The global estimation methods employ statistical approaches, adjusting for demographic variations and differences in data sources and collection periods.
xxxxxxxxxx### Data CleaningThe first step we took was create a function that takes a dataframe, checks if it has any missing or NaN values, drops those NaN values, and then return a cleaned up dataframe. We reused this function throughout the data cleaning step of all our datasetsThe first step we took was create a function that takes a dataframe, checks if it has any missing or NaN values, drops those NaN values, and then return a cleaned up dataframe. We reused this function throughout the data cleaning step of all our datasets
xxxxxxxxxx#Create a function that checks for any NaN values and then drops them. def checkNans(df): nan_vals_exist = df.isna().any().any() if nan_vals_exist: print("There are NaN values in the dataset.") df = df.dropna() return checkNans(df) else: print("No NaN values in the dataset.") return dfxxxxxxxxxxWe started with out first dataset which we named global_mental_health_df.We started with out first dataset which we named global_mental_health_df.
xxxxxxxxxx##Read Global Mental Health Datasetglobal_mental_health_df = pd.read_csv('Mental health Depression disorder Data 2.csv')print(global_mental_health_df.head())xxxxxxxxxxThe original dataset introduces new columns such as "Prevalence in Males" and "Prevalence in Females" starting from row 6468. These additional columns only included values for very few countries, which we eliminated since it would have given us an uneven dataset. So we decided to drop the rows after the last instance of the last country. Additionally, some of the values in the Year column were string instances that had to be converted to numbers. We also decided to drop some irrelevant columns like index and code. The original dataset contained NaNs and years going back to 9000 BCE for some regions. We dropped the NaN values and limited the dataset to years between 2000 and 2015 as that was the range of years that were in common across our datasets.The original dataset introduces new columns such as "Prevalence in Males" and "Prevalence in Females" starting from row 6468. These additional columns only included values for very few countries, which we eliminated since it would have given us an uneven dataset. So we decided to drop the rows after the last instance of the last country. Additionally, some of the values in the Year column were string instances that had to be converted to numbers. We also decided to drop some irrelevant columns like index and code. The original dataset contained NaNs and years going back to 9000 BCE for some regions. We dropped the NaN values and limited the dataset to years between 2000 and 2015 as that was the range of years that were in common across our datasets.
xxxxxxxxxx#drop all rows after Prevalence in males & femalesglobal_mental_health_df = global_mental_health_df.iloc[:6468]#convert year to numeric and limit dataset to between years 1990-2017global_mental_health_df['Year'] = pd.to_numeric(global_mental_health_df['Year'])global_mental_health_df = global_mental_health_df[(global_mental_health_df['Year'] >= 2000) &\ (global_mental_health_df['Year'] <= 2015)]#drop code columnglobal_mental_health_df = global_mental_health_df.drop(labels = ['index', 'Code'], axis = 1)#Rename Entity column to Countryglobal_mental_health_df= global_mental_health_df.rename(columns={'Entity': 'Country'})#drop rows with NaNsglobal_mental_health_df = checkNans(global_mental_health_df)print(global_mental_health_df.head())print(global_mental_health_df.shape)x
### Data Description:- The dataset contains information about different factors related to life expectancy. The health factors for 193 countries were collected from the WHO data repository website and its corresponding economic data was collected from United Nation website. - This information was then filtered out and compiled by Kumarra Jarshi to include critical factors that are more representative. He created the Life expectancy dataset for the purpose of putting into account the immunization and human development rates in relation to life expectancy considering demographic variables, income composition and mortality rates by formulating a regression model with data from a period of 2000 to 2015 for all the countries. - The dataset includes columns like 'Country', 'Year', 'Status', 'Life expectancy 'as well as many additional health status variables like 'Polio','under-five deaths ', and ' HIV/AIDS'.- Source:https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-whoObservations (Rows): The dataset consists of 2938 observations, each representing a specific country over the period from 2000 to 2015.Attributes (Columns): There are 22 attributes (columns) in the dataset, with each column representing a different variable. These variables include factors related to immunization, mortality, economics, and social aspects. **Purpose** \The dataset was created to analyze factors affecting life expectancy, focusing on critical health-related factors such as immunization, mortality, economics, and social factors. The aim is to understand the impact of these variables on life expectancy for 193 countries from the years 2000 to 2015. Who funded the creation of the dataset? Since this dataset was originally published by WHO it is possible that who funded the research however, the funding source for the creation of this dataset is not explicitly mentioned in the provided information. **Data Collection** \The data was collected from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO) and the United Nations website. The selection of factors considered critical was based on their representativeness in influencing life expectancy. The decision to focus on the years 2000 to 2015 reflects a period of significant health sector development. **Pre-processing** \Individual data files from WHO and the United Nations were merged into a single dataset. Handling Missing Values: Missing data, primarily for population, Hepatitis B, and GDP, were identified. The Missmap command in R was used to visualize and handle missing data. Some less-known countries with substantial missing data were excluded. We further confirmed there were no NaN values by applying our own checkNans function and dropping rows with Nans.Predicting variables were categorized into Immunization-related, Mortality-related, Economic, and Social factors. Awareness of Data Collection: Acknowledgements mention the assistance of Deeksha Russell and Duan Wang in collecting data from the WHO and United Nations websites, indicating that people were involved and aware of the data collection process. For our GDP dataset, also removed NaN values and to maintain consistency across datasets, we changed the label "Country/Area" to simply "Country" and filtered the dataset years to 2000 up to 2015. We also excluded redundant columns, such as "Unit," and narrowed down the dataset to retain only the "Country/Area," "Year," and "GDP" columns for the same reason. The dataset contains information about different factors related to life expectancy. The health factors for 193 countries were collected from the WHO data repository website and its corresponding economic data was collected from United Nation website.
This information was then filtered out and compiled by Kumarra Jarshi to include critical factors that are more representative. He created the Life expectancy dataset for the purpose of putting into account the immunization and human development rates in relation to life expectancy considering demographic variables, income composition and mortality rates by formulating a regression model with data from a period of 2000 to 2015 for all the countries.
The dataset includes columns like 'Country', 'Year', 'Status', 'Life expectancy 'as well as many additional health status variables like 'Polio','under-five deaths ', and ' HIV/AIDS'.
Source: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
Observations (Rows): The dataset consists of 2938 observations, each representing a specific country over the period from 2000 to 2015. Attributes (Columns): There are 22 attributes (columns) in the dataset, with each column representing a different variable. These variables include factors related to immunization, mortality, economics, and social aspects.
Purpose
The dataset was created to analyze factors affecting life expectancy, focusing on critical health-related factors such as immunization, mortality, economics, and social factors. The aim is to understand the impact of these variables on life expectancy for 193 countries from the years 2000 to 2015. Who funded the creation of the dataset? Since this dataset was originally published by WHO it is possible that who funded the research however, the funding source for the creation of this dataset is not explicitly mentioned in the provided information.
Data Collection
The data was collected from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO) and the United Nations website. The selection of factors considered critical was based on their representativeness in influencing life expectancy. The decision to focus on the years 2000 to 2015 reflects a period of significant health sector development.
Pre-processing
Individual data files from WHO and the United Nations were merged into a single dataset.
Handling Missing Values: Missing data, primarily for population, Hepatitis B, and GDP, were identified. The Missmap command in R was used to visualize and handle missing data. Some less-known countries with substantial missing data were excluded. We further confirmed there were no NaN values by applying our own checkNans function and dropping rows with Nans.
Predicting variables were categorized into Immunization-related, Mortality-related, Economic, and Social factors. Awareness of Data Collection: Acknowledgements mention the assistance of Deeksha Russell and Duan Wang in collecting data from the WHO and United Nations websites, indicating that people were involved and aware of the data collection process. For our GDP dataset, also removed NaN values and to maintain consistency across datasets, we changed the label "Country/Area" to simply "Country" and filtered the dataset years to 2000 up to 2015. We also excluded redundant columns, such as "Unit," and narrowed down the dataset to retain only the "Country/Area," "Year," and "GDP" columns for the same reason.
#Load the life expectancy datalife_df = pd.read_csv('1Life Expectancy Data, GPD, Popn.csv')print(life_df.head())xxxxxxxxxx### Data Cleaning:Since we were only interested in the life expectancy data from this dataset, we subsetted the the columns including country, year, and life expectancy. Afterwards, we applied our checkNans function to check for any NAN values and removed the corresponding rows from the dataset.Since we were only interested in the life expectancy data from this dataset, we subsetted the the columns including country, year, and life expectancy. Afterwards, we applied our checkNans function to check for any NAN values and removed the corresponding rows from the dataset.
xxxxxxxxxx# #Data Cleaningcolumns_to_keep=['Country', 'Year', 'Life expectancy ']# Subset the DataFrame to include the country, year, life expectancy, population, and GDPlife_df = life_df[columns_to_keep]# Display the subset DataFrameprint(life_df.head())print(life_df.shape)xxxxxxxxxx# Check for NaN valuescheckNans(life_df)print(life_df.head())print(life_df.shape)x
### Data Description:- The dataset contains information about the Gross Domestic Product(GDP) of countries from years 1970-2021. The columns include Country, Year, Unit, and GDP in US Dollars.- This dataset was collected from The National Accounts Main Aggregates Database, which presents a series of analytical national accounts tables from 1970 onwards for more than 200 countries and areas of the world. It is the product of a global cooperation effort between the Economic Statistics Branch of the United Nations Statistics Division, international statistical agencies and the national statistical services of these countries. The database is updated in December of each year with newly available national accounts data for all countries and areas.- Source:https://unstats.un.org/unsd/snaama/Basic**Purpose** \The dataset was created to provide users with crucial economic indicators. **Data Collection** \Data Observation and Recording Processes influencing data observation and recording: The dataset, comprising GDP estimates, is subject to revisions based on improvements in national accounts series, new data sources, better estimation methods, classifications, standards, and the process of "re-basing" volume estimates. The frequency of revisions is influenced by countries updating their national accounts methodology and incorporating new source statistics. **Pre-processing** \In this case, revisions occur due to incorporating new data sources and improving estimation methods, classifications, and standards. Re-basing volume estimates contributes to large changes in previously published values. Additionally, quality assurance procedures are employed to review major changes for accuracy before updating the World Development Indicators database.The World Bank uses quality assurance procedures to review major changes before updating databases. The freshness of the base year of national accounts is another factor that is considered a key indicator of statistical capacity. For this dataset, we subsetted the columns including country, year, and life expectancy. We did this because, from this dataset we only need life expectancy data. Just like the Global Mental Health Disorders dataset, we also removed NAN values. The dataset contains information about the Gross Domestic Product(GDP) of countries from years 1970-2021. The columns include Country, Year, Unit, and GDP in US Dollars.
This dataset was collected from The National Accounts Main Aggregates Database, which presents a series of analytical national accounts tables from 1970 onwards for more than 200 countries and areas of the world. It is the product of a global cooperation effort between the Economic Statistics Branch of the United Nations Statistics Division, international statistical agencies and the national statistical services of these countries. The database is updated in December of each year with newly available national accounts data for all countries and areas.
Purpose
The dataset was created to provide users with crucial economic indicators.
Data Collection
Data Observation and Recording Processes influencing data observation and recording: The dataset, comprising GDP estimates, is subject to revisions based on improvements in national accounts series, new data sources, better estimation methods, classifications, standards, and the process of "re-basing" volume estimates. The frequency of revisions is influenced by countries updating their national accounts methodology and incorporating new source statistics.
Pre-processing
In this case, revisions occur due to incorporating new data sources and improving estimation methods, classifications, and standards. Re-basing volume estimates contributes to large changes in previously published values. Additionally, quality assurance procedures are employed to review major changes for accuracy before updating the World Development Indicators database.The World Bank uses quality assurance procedures to review major changes before updating databases. The freshness of the base year of national accounts is another factor that is considered a key indicator of statistical capacity. For this dataset, we subsetted the columns including country, year, and life expectancy. We did this because, from this dataset we only need life expectancy data. Just like the Global Mental Health Disorders dataset, we also removed NAN values.
#Import GDP datagdp_df = pd.read_csv('gdp.csv')print(gdp_df.head())xxxxxxxxxx### Data Cleaning:We removed unnecessary columns such as the "Unit" and subsetted the dataset so only Country/Area, Year, and GDP columns were kept. We renamed the Country/Area to Country for consistency across datasets. Furthermore, we filtered the dataset to contain only the years between 2000 and 2015. Afterwards, we applied our checkNans function to check for any NAN values and removed the corresponding rows from the dataset. We removed unnecessary columns such as the "Unit" and subsetted the dataset so only Country/Area, Year, and GDP columns were kept. We renamed the Country/Area to Country for consistency across datasets. Furthermore, we filtered the dataset to contain only the years between 2000 and 2015. Afterwards, we applied our checkNans function to check for any NAN values and removed the corresponding rows from the dataset.
xxxxxxxxxx##Data Cleaning# List of years from 2000 to 2015columns_to_keep=['Country/Area', 'Year', 'GDP, at current prices - US Dollars']# Subset the DataFrame to include the country, year, life expectancy, population, and GDPgdp_df = gdp_df[columns_to_keep]gdp_df = gdp_df[(gdp_df['Year'] >= 2000) & (gdp_df['Year'] <= 2015)]#Rename Country/Area column to Countrygdp_df= gdp_df.rename(columns={'Country/Area': 'Country'})# Display the subset DataFrameprint(gdp_df.head())print(gdp_df.shape)xxxxxxxxxx# Check for NaN valuesgdp_df = checkNans(gdp_df)print(gdp_df.head())print(gdp_df.shape)xxxxxxxxxx### Data Description:- The Human Development Index(HDI) dataset was obtained from Human Development Reports by the United Nations Development Programme and it was created for the purpose of providing high quality international statistics that are free and accessible for all. It contains information on the Human Development Index for multiple countries from 1990-2021. The HDI is calculated by taking into account the Life Expectancy, Education, and Per-Capita Income of countries.- Source:https://hdr.undp.org/data-center/documentation-and-downloadsWhat are the observations (rows) and the attributes (columns)?Observations (rows): Each row contains HDI code that ranks the country as either low, medium, or high HDI, the region of the country, and HDI scores from 1990 to 2021. Attributes (columns): The columns consist of every country that was studied, the HDI code, region, HDI rates from 1990 to 2021. **Purpose** \The dataset was created to provide information on the Human Development Index (HDI) for various countries over the years. The HDI is a summary measure of average achievement in key dimensions of human development, including life expectancy, education, and standard of living.**Data Collection** \The dataset was obtained from Human Development Reports by the United Nations Development Programme (UNDP). The UNDP is a UN agency that works towards sustainable development and poverty reduction.The data observed and recorded are likely influenced by the availability of reliable and consistent information from different countries. The dataset focuses on key dimensions of human development, and the indicators included in the HDI are widely accepted as important factors. However, data availability, reporting practices, and political considerations in different countries may have influenced what data was included.The information provided doesn't explicitly mention whether people were aware of the data collection. However, given that the data is sourced from international data agencies with the mandate to collect national data on specific indicators, it is reasonable to assume that relevant authorities in each country were aware of the data collection. The purpose of collecting such data is likely for assessing and monitoring human development, informing policy decisions, and facilitating international comparisons.**Pre-processing** \The preprocessing steps include dropping unnecessary columns, renaming columns, subsetting the dataset to specific years (2000-2015), handling missing values, and converting the dataset from wide to long format using the 'melt' function. The data came from the original Human Development Reports, and preprocessing was done to make it suitable for analysis, including handling missing values and organizing it into a structured format.The Human Development Index(HDI) dataset was obtained from Human Development Reports by the United Nations Development Programme and it was created for the purpose of providing high quality international statistics that are free and accessible for all. It contains information on the Human Development Index for multiple countries from 1990-2021. The HDI is calculated by taking into account the Life Expectancy, Education, and Per-Capita Income of countries.
Source: https://hdr.undp.org/data-center/documentation-and-downloads
What are the observations (rows) and the attributes (columns)?
Observations (rows): Each row contains HDI code that ranks the country as either low, medium, or high HDI, the region of the country, and HDI scores from 1990 to 2021.
Attributes (columns): The columns consist of every country that was studied, the HDI code, region, HDI rates from 1990 to 2021.
Purpose
The dataset was created to provide information on the Human Development Index (HDI) for various countries over the years. The HDI is a summary measure of average achievement in key dimensions of human development, including life expectancy, education, and standard of living.
Data Collection
The dataset was obtained from Human Development Reports by the United Nations Development Programme (UNDP). The UNDP is a UN agency that works towards sustainable development and poverty reduction.
The data observed and recorded are likely influenced by the availability of reliable and consistent information from different countries. The dataset focuses on key dimensions of human development, and the indicators included in the HDI are widely accepted as important factors. However, data availability, reporting practices, and political considerations in different countries may have influenced what data was included.
The information provided doesn't explicitly mention whether people were aware of the data collection. However, given that the data is sourced from international data agencies with the mandate to collect national data on specific indicators, it is reasonable to assume that relevant authorities in each country were aware of the data collection. The purpose of collecting such data is likely for assessing and monitoring human development, informing policy decisions, and facilitating international comparisons.
Pre-processing
The preprocessing steps include dropping unnecessary columns, renaming columns, subsetting the dataset to specific years (2000-2015), handling missing values, and converting the dataset from wide to long format using the 'melt' function. The data came from the original Human Development Reports, and preprocessing was done to make it suitable for analysis, including handling missing values and organizing it into a structured format.
#Import HDI datahdi_df = pd.read_csv('1HDI rates.csv')print(hdi_df.head())xxxxxxxxxx### Data Cleaning:We started by manually deleting multiple columns on excel that included metrics that were used in calculating the HDI. We then dropped the hdi_rank_2021 column and then renamed all the column names to get just the years. We then subsetted the dataset so only Country, Year, and columns with years were kept. We capilalized the country column for consistency across datasets. Furthermore, we filtered the dataset to contain only the years between 2000 and 2015. Afterwards, we applied our checkNans function to check for any NAN values and removed the corresponding rows from the dataset. We started by manually deleting multiple columns on excel that included metrics that were used in calculating the HDI. We then dropped the hdi_rank_2021 column and then renamed all the column names to get just the years. We then subsetted the dataset so only Country, Year, and columns with years were kept. We capilalized the country column for consistency across datasets. Furthermore, we filtered the dataset to contain only the years between 2000 and 2015. Afterwards, we applied our checkNans function to check for any NAN values and removed the corresponding rows from the dataset.
xxxxxxxxxx#renaming column nameshdi_df = hdi_df.drop('hdi_rank_2021', axis=1)for column in hdi_df.columns: if column.startswith("hdi_"): new_column_name = column[4:] hdi_df.rename(columns={column: new_column_name}, inplace=True)print(hdi_df.head())xxxxxxxxxx# List of years from 2000 to 2015years_subset = [str(year) for year in range(2000, 2016)]# Subset the DataFrame to include both the country name and the required columnshdi_df_subset = hdi_df[['country'] + years_subset]#Rename country column to Countryhdi_df_subset= hdi_df_subset.rename(columns={'country': 'Country'})# Display the subset DataFrameprint(hdi_df_subset.head())xxxxxxxxxx# Check for NaN valuescheckNans(hdi_df_subset)print(hdi_df_subset.head())xxxxxxxxxxThe resulting dataframe is wide so we converted it to a long dataframe by melting it.The resulting dataframe is wide so we converted it to a long dataframe by melting it.
xxxxxxxxxx#Melting Dataframehdi_df_subset = pd.melt(hdi_df_subset, id_vars=['Country'], value_vars=[ '2000', '2001', '2002', '2003', '2004','2005', '2006',\ '2007', '2008', '2009', '2010','2011','2012', '2013', '2014','2015'], var_name='Year', value_name='HDI')print(hdi_df_subset.head())x
### Data Description- This is a pretty dense dataset that contains information about the prevalence of Urbanization Rate in multiple countries. Each row has information about a certain country or region for a certain year. The columns include Country Name, Country Code(the code for the country or region and Year, Indicator Name (code of indicator used in World Bank data), Indicator Code, and Year.- The dataset was obtained from Kaggle.com and was compiled by a person named "HANNA YUKHYMENKO". It was created for the purpose of understanding the occurrence of urban populations across different countries over time. Overall, the dataset allows for a deeper understanding of Urbanization rates in a country and their implications when it comes to impacting lives.- Link to Dataset: https://www.kaggle.com/datasets/equinxx/urban-population-19602021-by-countryObservations (Rows): Each row contains a country's urbanization rate for a particular year from 2000 to 2015.Attributes (Columns): The columns contain the name of the country, the year for which the urbanization rate is recorded and the percentage of urban population relative to the total population.**Purpose** \The dataset was created to explore and analyze the urbanization rates across different countries from 1960 to 2021. The inspiration, as mentioned in the information, is to understand how the percentage of urban population has developed in different continents, identify the "most urban" country, and examine countries with negative tendencies in urban population levels.**Data Collection** \The dataset was provided by the World Bank. The World Bank is a major international financial institution that provides financial and technical assistance to developing countries for development projects.The dataset is likely influenced by the availability of data from different countries. Not all countries may have comprehensive or consistent data for every year, and this could influence what data is observed and recorded. Additionally, the focus on urban population data suggests an interest in understanding trends related to urbanization.Although it’s not explicitly stated whether people were aware of the data collection, given that the dataset was created for analysis and exploration, it is reasonable to assume that data collection was conducted with a purpose, and stakeholders might have expected it to be used for gaining insights into urbanization trends and patterns. The dataset is also shared on Kaggle, suggesting an openness to public use and analysis.**Pre-processing** \Preprocessing steps include handling NaN values, converting string instances in the 'Year' column to numerical values, dropping irrelevant columns (Country Code, Indicator Name, Indicator Code), limiting the dataset to the years 2000-2015, and melting the dataset to reshape it. The 'Urbanization Rate' was the primary focus, and unnecessary information was removed to make the dataset more manageable and suitable for analysis.Observations (Rows): Each row contains a country's urbanization rate for a particular year from 2000 to 2015. Attributes (Columns): The columns contain the name of the country, the year for which the urbanization rate is recorded and the percentage of urban population relative to the total population.
Purpose
The dataset was created to explore and analyze the urbanization rates across different countries from 1960 to 2021. The inspiration, as mentioned in the information, is to understand how the percentage of urban population has developed in different continents, identify the "most urban" country, and examine countries with negative tendencies in urban population levels.
Data Collection
The dataset was provided by the World Bank. The World Bank is a major international financial institution that provides financial and technical assistance to developing countries for development projects.
The dataset is likely influenced by the availability of data from different countries. Not all countries may have comprehensive or consistent data for every year, and this could influence what data is observed and recorded. Additionally, the focus on urban population data suggests an interest in understanding trends related to urbanization.
Although it’s not explicitly stated whether people were aware of the data collection, given that the dataset was created for analysis and exploration, it is reasonable to assume that data collection was conducted with a purpose, and stakeholders might have expected it to be used for gaining insights into urbanization trends and patterns. The dataset is also shared on Kaggle, suggesting an openness to public use and analysis.
Pre-processing
Preprocessing steps include handling NaN values, converting string instances in the 'Year' column to numerical values, dropping irrelevant columns (Country Code, Indicator Name, Indicator Code), limiting the dataset to the years 2000-2015, and melting the dataset to reshape it. The 'Urbanization Rate' was the primary focus, and unnecessary information was removed to make the dataset more manageable and suitable for analysis.
xxxxxxxxxxWe added another dataset which we named urban_df.We added another dataset which we named urban_df.
xxxxxxxxxx#Import Urbanization Rate dataurban_df = pd.read_csv('1urban_percent.csv')print(urban_df.head())xxxxxxxxxxSome of the values in the Year column were string instances that had to be converted to numbers. We also decided to drop some irrelevant columns like Country Code, Indicator Name, and "Indicator Code". The original dataset contained NaNs so we removed rows that contained these NAN values. We renamed the Country Name to "Country" to make it alike with the other final datasets. We limited the dataset to years between 2000 and 2015 as that was the range of years that were in common across our datasets. Lastly, we melted our dataset so that the Country, Year and Urbanization Rate would appear as columns. Some of the values in the Year column were string instances that had to be converted to numbers. We also decided to drop some irrelevant columns like Country Code, Indicator Name, and "Indicator Code". The original dataset contained NaNs so we removed rows that contained these NAN values. We renamed the Country Name to "Country" to make it alike with the other final datasets. We limited the dataset to years between 2000 and 2015 as that was the range of years that were in common across our datasets. Lastly, we melted our dataset so that the Country, Year and Urbanization Rate would appear as columns.
xxxxxxxxxx# #Data Cleaning# List of years from 2000 to 2015years_subset = [str(year) for year in range(2000, 2016)]# Subset the DataFrame to include both the country name and the required columnsurban_df_subset = urban_df[['Country Name'] + years_subset]#Rename Country Name column to Countryurban_df_subset= urban_df_subset.rename(columns={'Country Name': 'Country'})# Display the subset DataFrameprint(urban_df_subset.head())xxxxxxxxxx# Check for NaN valuescheckNans(urban_df_subset)print(urban_df_subset.head())xxxxxxxxxx#Melting Dataframeurban_df_subset = pd.melt(urban_df_subset, id_vars=['Country'], value_vars=[ '2000', '2001', '2002', '2003', '2004','2005','2006', \ '2007', '2008', '2009', '2010','2011','2012', '2013', '2014','2015'], var_name='Year', value_name='Urbanization Rate')print(urban_df_subset.head())xxxxxxxxxx### Data Description:- This is a pretty dense dataset that contains information about the prevalence of Unemployment Rate in multiple countries. Each row has information about a certain country or region for a certain year. The columns include Country Name, Country Code(the code for the country or region)and Year.- The dataset was obtained from Kaggle.com and was compiled by a person named "Anjali pant". It was created for the purpose of understanding the occurance of unemployment rates across different countries over time. Overall, the dataset allows for a deeper understanding of the health of the economy in a country and their implications when it comes to impacting lives.- Link to Dataset: https://www.kaggle.com/datasets/pantanjali/unemployment-datasetObservations (Rows): Each row represents a specific country with data for the unemployment rate for the years 2000 to 2015.Attributes (Columns): The columns contain the name of the country, the year for which the unemployment rate is recorded and the percentage of unemployment in the corresponding year and country. **Purpose** \This dataset was created to provide information on the unemployment rate in various countries over the past 31 years (1991-2021). It aims to contribute to the understanding of economic conditions and trends related to unemployment. The introduction in the dataset highlights the importance of unemployment as an indicator of the health of the economy.**Data Collection** \The information provided doesn't explicitly mention who funded the creation of the dataset. However, the data was sourced from the World Bank, a reputable global development organization. The World Bank typically collects and provides economic data, and its funding comes from member countries.The data collection process is likely influenced by the World Bank's focus on economic indicators. Factors influencing data collection may include the availability of data from different countries, the reliability of the sources, and the importance of unemployment as a key economic indicator. The dataset seems to cover a broad range of countries and years, but certain countries or years might be missing due to data availability.Although it’s not specified whether individuals were aware of the data collection, given that the data is sourced from the World Bank, it's reasonable to assume that the countries contributing the data were aware, as international organizations typically collaborate with member countries for data collection. The primary purpose of collecting unemployment data is likely for economic analysis, policy-making, and monitoring global economic trends.**Pre-processing** \The dataset passed through several preprocessing steps to enhance its utility for analysis. Initially, the "Year" column, containing string instances, was converted into numerical values. To focus the analysis on a specific time frame, the dataset was limited to the years between 2000 and 2015. Recognizing the irrelevance of the "Country Code" column for the intended analysis, it was subsequently removed. Additionally, for consistency, the "Country Name" column was renamed to simply "Country." A crucial aspect of the preprocessing involved checking and confirming the absence of any missing values (NaN) in the dataset, ensuring data integrity. Finally, the dataset was melted, transforming its structure to feature distinct columns for Country, Year, and Unemployment Rate. These preprocessing measures collectively aimed at refining the dataset, making it more amenable for subsequent analytical endeavors.Observations (Rows): Each row represents a specific country with data for the unemployment rate for the years 2000 to 2015. Attributes (Columns): The columns contain the name of the country, the year for which the unemployment rate is recorded and the percentage of unemployment in the corresponding year and country.
Purpose
This dataset was created to provide information on the unemployment rate in various countries over the past 31 years (1991-2021). It aims to contribute to the understanding of economic conditions and trends related to unemployment. The introduction in the dataset highlights the importance of unemployment as an indicator of the health of the economy.
Data Collection
The information provided doesn't explicitly mention who funded the creation of the dataset. However, the data was sourced from the World Bank, a reputable global development organization. The World Bank typically collects and provides economic data, and its funding comes from member countries.
The data collection process is likely influenced by the World Bank's focus on economic indicators. Factors influencing data collection may include the availability of data from different countries, the reliability of the sources, and the importance of unemployment as a key economic indicator. The dataset seems to cover a broad range of countries and years, but certain countries or years might be missing due to data availability.
Although it’s not specified whether individuals were aware of the data collection, given that the data is sourced from the World Bank, it's reasonable to assume that the countries contributing the data were aware, as international organizations typically collaborate with member countries for data collection. The primary purpose of collecting unemployment data is likely for economic analysis, policy-making, and monitoring global economic trends.
Pre-processing
The dataset passed through several preprocessing steps to enhance its utility for analysis. Initially, the "Year" column, containing string instances, was converted into numerical values. To focus the analysis on a specific time frame, the dataset was limited to the years between 2000 and 2015. Recognizing the irrelevance of the "Country Code" column for the intended analysis, it was subsequently removed. Additionally, for consistency, the "Country Name" column was renamed to simply "Country." A crucial aspect of the preprocessing involved checking and confirming the absence of any missing values (NaN) in the dataset, ensuring data integrity. Finally, the dataset was melted, transforming its structure to feature distinct columns for Country, Year, and Unemployment Rate. These preprocessing measures collectively aimed at refining the dataset, making it more amenable for subsequent analytical endeavors.
xxxxxxxxxxWe added another dataset which we named unemploy_df.We added another dataset which we named unemploy_df.
xxxxxxxxxx#Import Unemployment Rate dataunemploy_df = pd.read_csv('1unemployment analysis.csv')print(unemploy_df.head())xxxxxxxxxxSome of the values in the Year column were string instances that had to be converted to numbers. Limited the dataset to years between 2000 and 2015 as that was the range of years that were in common across our datasets. The original dataset contains Country code, we removed this column since it is irrelevant for our dataset. We renamed the Country Name to "Country" to make it alike with the other final datasets. We checked for NaN values, but didn't have to drop any rows since there hasn't occurred any NaN values. Lastly, we melted our dataset so that the Country, Year and Unemployment Rate would appear as columns. Some of the values in the Year column were string instances that had to be converted to numbers. Limited the dataset to years between 2000 and 2015 as that was the range of years that were in common across our datasets. The original dataset contains Country code, we removed this column since it is irrelevant for our dataset. We renamed the Country Name to "Country" to make it alike with the other final datasets. We checked for NaN values, but didn't have to drop any rows since there hasn't occurred any NaN values. Lastly, we melted our dataset so that the Country, Year and Unemployment Rate would appear as columns.
xxxxxxxxxx# #Data Cleaning# List of years from 2000 to 2015years_subset = [str(year) for year in range(2000, 2016)]# Subset the DataFrame to include both the country name and the required columnsunemploy_df_subset = unemploy_df[['Country Name'] + years_subset]#Rename Country Name column to Countryunemploy_df_subset= unemploy_df_subset.rename(columns={'Country Name': 'Country'})# Display the subset DataFrameprint(unemploy_df_subset.head())xxxxxxxxxxcheckNans(unemploy_df_subset)print(unemploy_df_subset.head())xxxxxxxxxx#Melting Dataframeunemploy_df_subset = pd.melt(unemploy_df_subset, id_vars=['Country'], value_vars=[ '2000', '2001', '2002', '2003', '2004','2005','2006', \ '2007', '2008', '2009', '2010','2011','2012', '2013', '2014','2015'], var_name='Year', value_name='Unemployment Rate')print(unemploy_df_subset.head())xxxxxxxxxx## Merging DatasetsNow that we have cleaned up all our datasets, we then moved to our next step which is merging all the datasets. We first created a function called changeYear that takes a dataframe, changes the Year column into a date type, and then displays the Year. We did this to make the type of the year consistenct across all dataframes before we merge them. We were able to merge the datasets based on the the common countries and years across all dataframes. As we did with all other dataframes, we called our checkNans function remove all Nan valuesNow that we have cleaned up all our datasets, we then moved to our next step which is merging all the datasets. We first created a function called changeYear that takes a dataframe, changes the Year column into a date type, and then displays the Year. We did this to make the type of the year consistenct across all dataframes before we merge them. We were able to merge the datasets based on the the common countries and years across all dataframes. As we did with all other dataframes, we called our checkNans function remove all Nan values
xxxxxxxxxxdef changeYear(df): df['Year'] = pd.to_datetime(df['Year'], format='%Y') df['Year'] = df['Year'].dt.yearchangeYear(global_mental_health_df)changeYear(life_df)changeYear(gdp_df)changeYear(hdi_df_subset)changeYear(urban_df_subset)changeYear(unemploy_df_subset)dfs = [global_mental_health_df, life_df, gdp_df, hdi_df_subset, urban_df_subset, unemploy_df_subset]merged_df = reduce(lambda left, right: pd.merge(left, right, on=['Country', 'Year'], how='inner'), dfs)print(merged_df.head())print(merged_df.shape)xxxxxxxxxx#Check for NaN values in the merged datasetmerged_df = checkNans(merged_df)merged_df.shapexxxxxxxxxx#Check for unique countriesunique_countries = merged_df['Country'].unique()print(unique_countries)num_unique_countries = merged_df['Country'].nunique()print(num_unique_countries)xxxxxxxxxxThe final step we took was limit the years from 2000 to 2015 because that was the range of years all our dataframes had in common. The final step we took was limit the years from 2000 to 2015 because that was the range of years all our dataframes had in common.
xxxxxxxxxx#Filter countries that don't have entries from 2000 to 2015 final_df = duckdb.sql('SELECT *\ FROM merged_df \ WHERE Country IN ( \ SELECT Country \ FROM merged_df \ WHERE Year BETWEEN 2000 AND 2015 \ GROUP BY Country \ HAVING COUNT(DISTINCT Year) = 16)').df()print(final_df.shape)final_df.head()xxxxxxxxxx#Check for the number of unique countriesnum_unique_countries = final_df['Country'].nunique()print(num_unique_countries)xxxxxxxxxx#Check for NaN values in the final datasetfinal_df = checkNans(final_df)final_df.shapexxxxxxxxxx#Change the datatype of all columns except Country and Year numerical valuesfinal_df.columns# final_df.iloc[:, 2:].apply(pd.to_numeric, errors='coerce')columns_to_convert = ['Schizophrenia (%)', 'Bipolar disorder (%)', 'Eating disorders (%)', 'Anxiety disorders (%)', 'Drug use disorders (%)', 'Depression (%)', 'Alcohol use disorders (%)', 'Life expectancy ', 'GDP, at current prices - US Dollars', 'HDI', 'Urbanization Rate', 'Unemployment Rate']final_df[columns_to_convert] = final_df[columns_to_convert].apply(pd.to_numeric, errors='coerce')#Check if they are convertedprint(final_df.dtypes)#Check for Nans one last time and print shapefinal_df = checkNans(final_df)print(final_df.shape)print(final_df.head())xxxxxxxxxxmental_health_disorders = [ 'Schizophrenia (%)', 'Bipolar disorder (%)', 'Eating disorders (%)', 'Anxiety disorders (%)', 'Drug use disorders (%)', 'Depression (%)', 'Alcohol use disorders (%)']# Group the data by 'Country'grouped = final_df.groupby('Country')# Calculate slopes for each mental health disorder for each countryslopes = {}for country, group in grouped: country_slopes = {} years = group['Year'] for disorder in mental_health_disorders: slope, _, _, _, _ = linregress(years, group[disorder]) country_slopes[disorder] = slope slopes[country] = country_slopes# Find the country and disorder with the highest slopemax_slope = -float('inf')max_country = Nonemax_disorder = Nonefor country, country_slopes in slopes.items(): for disorder, slope in country_slopes.items(): if slope > max_slope: max_slope = slope max_country = country max_disorder = disorderprint(f"Country with the highest slope: {max_country}")print(f"Mental health disorder with the highest slope: {max_disorder}")print(f"Highest slope value: {max_slope}")xxxxxxxxxxThe output indicates that among the countries in our dataset, Mongolia has the highest slope for the Alcohol use disorders rates. This suggests that, based on the trends observed in the data, Mongolia has shown the highest rate of change (increase or decrease) in the prevalence of alcohol use disorders over the years captured in our dataset compared to other countries. To check whether or not this change was positive or negative, we can create a line plot.The output indicates that among the countries in our dataset, Mongolia has the highest slope for the Alcohol use disorders rates. This suggests that, based on the trends observed in the data, Mongolia has shown the highest rate of change (increase or decrease) in the prevalence of alcohol use disorders over the years captured in our dataset compared to other countries. To check whether or not this change was positive or negative, we can create a line plot.
xxxxxxxxxx#Subset data for Mongolia and Alcohol use disorders (%)mongolia_data = final_df[final_df['Country'] == 'Mongolia']alcohol_data = mongolia_data[['Year', 'Alcohol use disorders (%)']]years = alcohol_data['Year']alcohol_disorder = alcohol_data['Alcohol use disorders (%)']# Plot trend for Mongoliaplt.figure(figsize=(8, 6))plt.plot(years, alcohol_disorder, marker='o', linestyle='-')plt.title('Trend of Alcohol use disorders in Mongolia')plt.xlabel('Year')plt.ylabel('Alcohol use disorders (%)')# Adding regression lineslope, intercept, _, _, _ = linregress(years, alcohol_disorder)plt.plot(years, slope * years + intercept, color='r', label='Regression Line')plt.legend()plt.show()xxxxxxxxxxWe can see that on our line plot how alcohol use disorder rates have been increasing in Mongolia through the years. We can also graph the rate of other disorders in Mongolia to further support our observation.We can see that on our line plot how alcohol use disorder rates have been increasing in Mongolia through the years. We can also graph the rate of other disorders in Mongolia to further support our observation.
xxxxxxxxxx# Set labels for the legenddisorder_labels = ['Schizophrenia', 'Bipolar disorder', 'Eating disorders', 'Anxiety disorders', 'Drug use disorders', 'Depression', 'Alcohol use disorders']# Plot each mental health disorderplt.figure(figsize=(10, 6)) for i, disorder in enumerate(mental_health_disorders): plt.plot(mongolia_data['Year'], mongolia_data[disorder], label=disorder_labels[i])# Set labels and titleplt.xlabel('Year')plt.ylabel('Percentage')plt.title('Trends of Mental Health Disorders in Mongolia')plt.legend() plt.show()xxxxxxxxxxThe fact that most of the disorders appear as straight lines while one stands out with an increasing trend could indicate that the data for those disorders might be relatively stable or constant across the years, while alcohol use disorders are showing a noticeable change or increase over time in Mongolia.The fact that most of the disorders appear as straight lines while one stands out with an increasing trend could indicate that the data for those disorders might be relatively stable or constant across the years, while alcohol use disorders are showing a noticeable change or increase over time in Mongolia.
xxxxxxxxxx### BoxplotWe decided to visualize every mental health disorder across every country for the year 2009 using a boxplot to display the distribution of each disorder across every country. We decided to visualize every mental health disorder across every country for the year 2009 using a boxplot to display the distribution of each disorder across every country.
# Filter data for the year 2010# Select mental health disorder columnsmental_health_columns = ['Schizophrenia (%)', 'Bipolar disorder (%)', 'Eating disorders (%)', 'Anxiety disorders (%)', 'Drug use disorders (%)', 'Depression (%)', 'Alcohol use disorders (%)']final_df_2009 = final_df[final_df["Year"] == 2009]# Create a boxplot for each mental health disorderplt.figure(figsize=(10, 6))final_df_2009[mental_health_columns].boxplot()plt.title('Boxplot of Mental Health Disorders in 2009')plt.ylabel('Percentage')plt.xlabel('Mental Health Disorders')plt.xticks(rotation=45) # Rotate x-axis labels if needed for better visibilityplt.show()xxxxxxxxxxFrom this boxpplot, we can see that anxiety disorder has the highest averge percentage of all disorders and depression rates average at the second highest. Thus, we plan on testing how we can predict depression and anxiety rates in our hypotheses. From this boxpplot, we can see that anxiety disorder has the highest averge percentage of all disorders and depression rates average at the second highest. Thus, we plan on testing how we can predict depression and anxiety rates in our hypotheses.
xxxxxxxxxxWe now plot a scatterplot to see which indepdent variable out of 'Life expectancy', 'GDP', 'HDI', 'Urbanization Rate', and 'Unemployment Rate' might have had the largest impact in the increasing alcohol rate.We now plot a scatterplot to see which indepdent variable out of 'Life expectancy', 'GDP', 'HDI', 'Urbanization Rate', and 'Unemployment Rate' might have had the largest impact in the increasing alcohol rate.
variables_of_interest = ['Life expectancy ', 'GDP, at current prices - US Dollars', 'HDI', \ 'Urbanization Rate', 'Unemployment Rate']fig, axes = plt.subplots(nrows=len(variables_of_interest), ncols=1, figsize=(8, 6 * len(variables_of_interest)))# Iterate through each variable and create individual scatter plotsfor i, variable in enumerate(variables_of_interest): sns.scatterplot(x=variable, y='Alcohol use disorders (%)', data=mongolia_data, ax=axes[i]) axes[i].set_xlabel(variable) axes[i].set_ylabel('Alcohol use disorders (%)') axes[i].set_title(f'Relationship between {variable} and Alcohol use disorders (%)')plt.tight_layout()plt.show()xxxxxxxxxx### HeatmapThe correlation heatmap consideres a specific year and compares all countries from our dataset across all possible variables that are considered for our project. From this heatmap, we can observe a correlations amongst multiple variables, including any variables that may behave in a similar fashion. For instance, alcohol use and umemployment rate are negatively correlated with every mental health disorder, and depression is negatively correlated with every variable. Alternatively, there are certain clusters that indicate similar trends, such as bipolar, eating disorder, and anxiety disorders all being positively correlated with each other. Furthermore, we can get a general idea of which variables have low correlation with mental heatlth disorers. For instance, life expectancy and HDI are minimally correlated with drug use, which indicates those variables may not be effective in modeling changes in drug use. The correlation heatmap consideres a specific year and compares all countries from our dataset across all possible variables that are considered for our project. From this heatmap, we can observe a correlations amongst multiple variables, including any variables that may behave in a similar fashion. For instance, alcohol use and umemployment rate are negatively correlated with every mental health disorder, and depression is negatively correlated with every variable. Alternatively, there are certain clusters that indicate similar trends, such as bipolar, eating disorder, and anxiety disorders all being positively correlated with each other. Furthermore, we can get a general idea of which variables have low correlation with mental heatlth disorers. For instance, life expectancy and HDI are minimally correlated with drug use, which indicates those variables may not be effective in modeling changes in drug use.
#Function to create a heatmap based off year selecteddef makeHeatmap(df, year): heatmap_data = df.iloc[:,1:] heatmap_data = heatmap_data[heatmap_data['Year'] == year] heatmap_data = heatmap_data.drop('Year', axis = 1) correlation_matrix = heatmap_data.corr() plt.figure(figsize=(6,4)) sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f") plt.title(f'Correlation Heatmap for Mental Disorders in {year} Across Countries') plt.show()makeHeatmap(final_df, 2015)makeHeatmap(final_df, 2008)xxxxxxxxxx## Data Limitations**Data bias:** When we were cleaning the data and dropping NaNs, we realized there is unequal representation because some countries appear more frequently than others leading to a skewed distribution of data. This overrepresentation would limit the generalizability and validity of our findings and potentially leads to overlooking less-represented nations.Additionally, this will affect the rankings and comparisons in our results because our data only represents 135 countries out of the 195 countries in the world excluding less known countries like Vanuatu, Tonga, Togo, Cabo Verde, etc. **Repetitiveness:**Our dataset on life expectancy incorporates both GDP and population data for each year, which has the potential to significantly increase the size of the dataset, posing increased challenges in terms of data management and analysis, particularly when dealing with a sizable dataset.Additionally, this could lead to redundancy. Given that the identical GDP and population data may be replicated for each corresponding life expectancy record, this has the potential to create inefficiencies in data storage and analytical processes.**Mental health disorders could overlap:**The global mental health disorders dataset contains categories that may overlap, since mental health disorders are not mutually exclusive. For instance, people with anxiety can also have depression. Mental health disorders may not necessarily exist in isolation, and individuals can simultaneously suffer from multiple disorders. This co-occurrence of mental health conditions presents several constraints that cause limitations when conducting analysis and interpreting results.The original data we used, obtained from the global trends in mental health disorders dataset, consists of columns that provide details regarding the Entity, Country Code, Year, and various mental health conditions, including Schizophrenia, Bipolar Disorder, Eating Disorders, Anxiety Disorders, Drug Use Disorders, Depression, and Alcohol Use Disorders. However, it's important to note that this dataset doesn't accommodate the presence of that overlap among some of these disorders. As a result, distinguishing the individual influence of each disorder becomes challenging, and it may be problematic to identify whether the observed effects come from a single disorder or interactions among multiple disorders.**Different countries have different metrics for identifying diseases:**Different countries may use different standards and measures in the identification and diagnosis of diseases. These disparities can impact our analytical procedures and the way we interpret results, causing disparities in disease identification, which, in turn, can complicate cross-country comparisons and give rise to discrepancies in reported disease prevalence. Additionally, cultural and societal norms can further shape how individuals perceive and report symptoms, thus influencing their inclination to seek medical attention and self-reported data. For example, according to the National Institutes of Health, in the United States of America, an individual must have five depression symptoms every day, nearly all day, for at least 2 weeks while according to the National Library of Medicine, Kenyans diagnose depression with tobacco use and engagement in binge drinking. **Data Availability:**In our initial raw data, not all years seamlessly align or fit perfectly with other datasets. As a result, we found it essential to refine the scope of our analysis and narrow the years down. For instance, the unemployment analysis dataset covers the years 1991 to 2021, while other datasets, such as the life expectancy dataset, has the years 2000 to 2015. This led us to concentrate on distinct years characterized by complete and dependable data.The HDI dataset was from Human Development Reports by the United Nations Development Programme and this dataset was created for the purpose of providing high quality international statistics that are free and accessible for all. There is no reason to question the trustworthiness of this dataset because it is an international organization that would not have personal interests to fulfill if the data results were biased.Hanna Yukhymenko created the Urbanization dataset with the purpose of answering the question of how the percentage of urban population has developed in different continents as well as finding out what the most urban country is through analyzing both percentage and total urban population. She got this information from The World Bank organization. Even though this information is from a trustworthy organization, Hanna Yukhymenko is from Switzerland and may have been biased when creating this dataset especially because it is comparing countries.Anjali Pant, an Engineering student from India created the unemployment dataset with the purpose of analyzing why unemployment occurs. Anjali’s country, India, has a relatively stable unemployment rate across most years hence leading us to believe that this dataset could be trustworthy. Additionally, Anjali got this data from The World Bank Organization which is also an internationally recognized organization.Kumarra Jarshi created the Life expectancy dataset for the purpose of putting into account the immunization and human development rates in relation to life expectancy considering demographic variables, income composition and mortality rates by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. We have no reason to question the trustworthiness of this dataset because it cited the World Health Organization (WHO) as the source of the accurate data. WHO uses standardized metrics to collect data.Data bias:
When we were cleaning the data and dropping NaNs, we realized there is unequal representation because some countries appear more frequently than others leading to a skewed distribution of data. This overrepresentation would limit the generalizability and validity of our findings and potentially leads to overlooking less-represented nations.
Additionally, this will affect the rankings and comparisons in our results because our data only represents 135 countries out of the 195 countries in the world excluding less known countries like Vanuatu, Tonga, Togo, Cabo Verde, etc.
Repetitiveness:
Our dataset on life expectancy incorporates both GDP and population data for each year, which has the potential to significantly increase the size of the dataset, posing increased challenges in terms of data management and analysis, particularly when dealing with a sizable dataset.
Additionally, this could lead to redundancy. Given that the identical GDP and population data may be replicated for each corresponding life expectancy record, this has the potential to create inefficiencies in data storage and analytical processes.
Mental health disorders could overlap:
The global mental health disorders dataset contains categories that may overlap, since mental health disorders are not mutually exclusive. For instance, people with anxiety can also have depression.
Mental health disorders may not necessarily exist in isolation, and individuals can simultaneously suffer from multiple disorders. This co-occurrence of mental health conditions presents several constraints that cause limitations when conducting analysis and interpreting results.
The original data we used, obtained from the global trends in mental health disorders dataset, consists of columns that provide details regarding the Entity, Country Code, Year, and various mental health conditions, including Schizophrenia, Bipolar Disorder, Eating Disorders, Anxiety Disorders, Drug Use Disorders, Depression, and Alcohol Use Disorders. However, it's important to note that this dataset doesn't accommodate the presence of that overlap among some of these disorders. As a result, distinguishing the individual influence of each disorder becomes challenging, and it may be problematic to identify whether the observed effects come from a single disorder or interactions among multiple disorders.
Different countries have different metrics for identifying diseases:
Different countries may use different standards and measures in the identification and diagnosis of diseases. These disparities can impact our analytical procedures and the way we interpret results, causing disparities in disease identification, which, in turn, can complicate cross-country comparisons and give rise to discrepancies in reported disease prevalence. Additionally, cultural and societal norms can further shape how individuals perceive and report symptoms, thus influencing their inclination to seek medical attention and self-reported data. For example, according to the National Institutes of Health, in the United States of America, an individual must have five depression symptoms every day, nearly all day, for at least 2 weeks while according to the National Library of Medicine, Kenyans diagnose depression with tobacco use and engagement in binge drinking.
Data Availability:
In our initial raw data, not all years seamlessly align or fit perfectly with other datasets. As a result, we found it essential to refine the scope of our analysis and narrow the years down. For instance, the unemployment analysis dataset covers the years 1991 to 2021, while other datasets, such as the life expectancy dataset, has the years 2000 to 2015. This led us to concentrate on distinct years characterized by complete and dependable data.
The HDI dataset was from Human Development Reports by the United Nations Development Programme and this dataset was created for the purpose of providing high quality international statistics that are free and accessible for all. There is no reason to question the trustworthiness of this dataset because it is an international organization that would not have personal interests to fulfill if the data results were biased.
Hanna Yukhymenko created the Urbanization dataset with the purpose of answering the question of how the percentage of urban population has developed in different continents as well as finding out what the most urban country is through analyzing both percentage and total urban population. She got this information from The World Bank organization. Even though this information is from a trustworthy organization, Hanna Yukhymenko is from Switzerland and may have been biased when creating this dataset especially because it is comparing countries.
Anjali Pant, an Engineering student from India created the unemployment dataset with the purpose of analyzing why unemployment occurs. Anjali’s country, India, has a relatively stable unemployment rate across most years hence leading us to believe that this dataset could be trustworthy. Additionally, Anjali got this data from The World Bank Organization which is also an internationally recognized organization.
Kumarra Jarshi created the Life expectancy dataset for the purpose of putting into account the immunization and human development rates in relation to life expectancy considering demographic variables, income composition and mortality rates by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. We have no reason to question the trustworthiness of this dataset because it cited the World Health Organization (WHO) as the source of the accurate data. WHO uses standardized metrics to collect data.
xxxxxxxxxx## Preregistration Statements**Hypothesis 1:** \Null Hypothesis: There is no statistically significant difference in depression rates in the population between countries with a GDP lower than 500 billion and countries with a GDP of higher than 500 billion.Alternative Hypothesis: The percentage of the population with depression is significantly higher in countries where the GDP is lower. We would expect to see an inverse relationship between GDP and depression levels. **Hypothesis 2:** \Null Hypothesis: There is no statistically significant difference in anxiety disorders between countries with a greater decrease in unemployment rate.Alternative Hypothesis: An increase in anxiety disorders is more common in countries with a decrease in unemployment rates, potentially reflecting the impact of economic stability on mental health.- Note: We made a small error on the original hypothesis saying an increase in anxiety disorders is more common in countries with a DECREASE in unemployment rate. On phase four, we fixed that error and changed our hypothesis to an increase in anxiety disorders is more common in countries with an INCREASE in unemployment rates. **Hypothesis 3:** \Null Hypothesis: None of the socioeconomic factors are able to predict any of the mental health disorder rates. Alternative Hypothesis: The combined influence of every significant sociaeconomic factor (GDP, Life Expectancy, Urbanization Rate, HDI, Unemployment Rate) can help predict at least one of the mental health disorders. - Note: Hypothesis 3 was not preregistered but we believed it would be relevant to test as many independent and dependent variables in creating our final model.Hypothesis 1:
Null Hypothesis: There is no statistically significant difference in depression rates in the population between countries with a GDP lower than 500 billion and countries with a GDP of higher than 500 billion.
Alternative Hypothesis: The percentage of the population with depression is significantly higher in countries where the GDP is lower. We would expect to see an inverse relationship between GDP and depression levels.
Hypothesis 2:
Null Hypothesis: There is no statistically significant difference in anxiety disorders between countries with a greater decrease in unemployment rate.
Alternative Hypothesis: An increase in anxiety disorders is more common in countries with a decrease in unemployment rates, potentially reflecting the impact of economic stability on mental health.
Hypothesis 3:
Null Hypothesis: None of the socioeconomic factors are able to predict any of the mental health disorder rates.
Alternative Hypothesis: The combined influence of every significant sociaeconomic factor (GDP, Life Expectancy, Urbanization Rate, HDI, Unemployment Rate) can help predict at least one of the mental health disorders.
xxxxxxxxxx## Hypothesis 1Null Hypothesis: There is no statistically significant difference in depression rates in the population between countries with a GDP lower than 500 billion and countries with a GDP of higher than 500 billion.Alternative Hypothesis: The percentage of the population with depression is significantly higher in countries where the GDP is lower. We would expect to see an inverse relationship between GDP and depression levels. Null Hypothesis: There is no statistically significant difference in depression rates in the population between countries with a GDP lower than 500 billion and countries with a GDP of higher than 500 billion.
Alternative Hypothesis: The percentage of the population with depression is significantly higher in countries where the GDP is lower. We would expect to see an inverse relationship between GDP and depression levels.
xxxxxxxxxxOur first hypothesis investigated if countries with an average GDP of less than 500 billion between 2000 and 2015 have higher rates of depression. In our preregistration statement, we planned on setting GDP as a dummy input where 1 represented countries with an average GDP of 500 billion or more. Instead, we analyzed GDP as a range of values. We first subsetted our final dataframe to create a new dataframe called final_gdp containing columns Country, Year, GDP, and Depression. Our first hypothesis investigated if countries with an average GDP of less than 500 billion between 2000 and 2015 have higher rates of depression. In our preregistration statement, we planned on setting GDP as a dummy input where 1 represented countries with an average GDP of 500 billion or more. Instead, we analyzed GDP as a range of values. We first subsetted our final dataframe to create a new dataframe called final_gdp containing columns Country, Year, GDP, and Depression.
#Subset the final_df to only include necessary columns (Country, Year, GDP, Depression)columns_of_interest = ["Country", "Year", "GDP, at current prices - US Dollars", "Depression (%)"]final_gdp = final_df[columns_of_interest]#Create two new dataframes: One for instances of countries with GDP higher than 500 billion and #one for GDP lower than 500 billion#Higher than 500 billionfinal_gdp['GDP, at current prices - US Dollars'] = pd.to_numeric(final_gdp['GDP, at current prices - US Dollars'],\ errors='coerce')checkNans(final_gdp)high_gdp = final_gdp.groupby('Country')['GDP, at current prices - US Dollars'].mean()high_gdp = high_gdp[high_gdp >= 500_000_000_000].indexhigh_gdp = final_gdp[final_df['Country'].isin(high_gdp)]checkNans(high_gdp)print(high_gdp.head())#Lower than 500 billionlow_gdp = final_gdp.groupby('Country')['GDP, at current prices - US Dollars'].mean()low_gdp = low_gdp[low_gdp < 500_000_000_000].indexlow_gdp = final_gdp[final_df['Country'].isin(low_gdp)]checkNans(low_gdp)print(low_gdp.head())print(final_gdp.head())#We can run a difference of means ttest to determine significance in depression rates between #high GDP countries and low GDP countriest_stat, p_value = ttest_ind(high_gdp['Depression (%)'], low_gdp['Depression (%)'], equal_var=False)# Display the resultsprint(f'T-statistic: {t_stat:.5}')print(f'P-value: {p_value:.5}')# Interpret the resultsif p_value < 0.05: print('The difference in means is statistically significant.')else: print('There is no significant difference in means.')xxxxxxxxxxThe pvalue is less than 0.05, which means there is a statistically significant difference between depression levels in countries with a high GDP and in countries with a low GDP. To get further statistical information regarding the relationship between gdp and depression, we can run an OLS regression. The pvalue is less than 0.05, which means there is a statistically significant difference between depression levels in countries with a high GDP and in countries with a low GDP. To get further statistical information regarding the relationship between gdp and depression, we can run an OLS regression.
#Check for heteroskedasticity between Depression and GDPsns.residplot(data=final_gdp, x = final_gdp["GDP, at current prices - US Dollars"], \ y = final_gdp['Depression (%)'])plt.show()xxxxxxxxxxThe residual plots show that Depression (%) is not randomly distributed, and cluster closer to the lower end of the GDP scale. This means our data is heteroskedastic and we will need to transform the depression (%) values.The residual plots show that Depression (%) is not randomly distributed, and cluster closer to the lower end of the GDP scale. This means our data is heteroskedastic and we will need to transform the depression (%) values.
# Log transformationfinal_gdp['Log_GDP'] = np.log(final_gdp['GDP, at current prices - US Dollars'])final_df['Log_GDP'] = np.log(final_gdp['GDP, at current prices - US Dollars'])print(final_gdp.head())#Re-check for heteroskedasticity with transformed valuessns.residplot(data=final_gdp, x=final_gdp["Log_GDP"], y=final_gdp['Depression (%)'])plt.show()xxxxxxxxxxThe regraphed residual plot displays a random distribution around the x-axis between log transformed GDP valuesand GDP rates.The regraphed residual plot displays a random distribution around the x-axis between log transformed GDP values and GDP rates.
# Run an Ordinary Least Squares Regression# Separate predictors (X) and target variable (y)X = final_gdp[['Log_GDP']]y = final_gdp['Depression (%)'] # Add a constant term to the predictor matrixX = sm.add_constant(X)# Fit the linear regression modelmodel = sm.OLS(y, X).fit()# Print the intercept, coef, and regression summaryprint(model.summary())xxxxxxxxxxThe OLS Regression Results displays a p value rounded to 0, which is less than 0.05 so we can reject the null hypothesis and conclude that a high GDP is associated with depression rates. Furthermore, the confindence interval for high_gdp doesn't include 0, which indicates that 95% of β values is between 0.025 and 0.051, so this indicates that gdp has an effect in the positive direction on the depression rates, whereas our alternative hypothesis states that there would be an increase in depression in lower GDP countries.Thus, in order to further confirm this, we can run a linear regression.The OLS Regression Results displays a p value rounded to 0, which is less than 0.05 so we can reject the null hypothesis and conclude that a high GDP is associated with depression rates. Furthermore, the confindence interval for high_gdp doesn't include 0, which indicates that 95% of β values is between 0.025 and 0.051, so this indicates that gdp has an effect in the positive direction on the depression rates, whereas our alternative hypothesis states that there would be an increase in depression in lower GDP countries. Thus, in order to further confirm this, we can run a linear regression.
#Run a Linear Regressionx = final_gdp[["Log_GDP"]]y = final_gdp[["Depression (%)"]]linmodel = LinearRegression().fit(x,y)yhat = linmodel.predict(x)print(f"The regression coefficient is: {linmodel.coef_[0]}")print(f"The intercept is: {linmodel.intercept_}")#Plot as scatterplotplt.scatter(x, y)# Plotting the regression lineplt.plot(x, yhat, color='red', linewidth=2)# Adding labels and titleplt.xlabel('GDP')plt.ylabel('Depression (%)')plt.legend()# Show the plotplt.show()xxxxxxxxxxWe ran a linear regression of the transformed GDP values to calculate the regression coefficient of Depression (%). The regression coefficient means that for a 1% change in GDP, there is a 0.003797762 change in depression. The intercept tells us that when GDP is 0 for any given country, the expected Depression (%) is 2.574.We ran a linear regression of the transformed GDP values to calculate the regression coefficient of Depression (%). The regression coefficient means that for a 1% change in GDP, there is a 0.003797762 change in depression. The intercept tells us that when GDP is 0 for any given country, the expected Depression (%) is 2.574.
xxxxxxxxxx## PermutationTo determine if the 0.03797762 actually represents a relationship between GDP and depression levels, we can permute the GDP and Depression columns. We randomly sampled 1000 data values from each column and ran a linear regression for every possible combination of GDP and Depression values. We then stored the slope of each regression and compared it to our observed slope of 0.03797762.To determine if the 0.03797762 actually represents a relationship between GDP and depression levels, we can permute the GDP and Depression columns. We randomly sampled 1000 data values from each column and ran a linear regression for every possible combination of GDP and Depression values. We then stored the slope of each regression and compared it to our observed slope of 0.03797762.
#Sample random pairsX_1000 = final_gdp["Log_GDP"].sample(n=1000)Y_1000 = final_gdp["Depression (%)"].sample(n=1000)permutation_slopes = np.zeros(1000)for i in range(1000): permuted_X = np.random.choice(X_1000, 1000, replace=False) df = pd.DataFrame({"X": permuted_X, "Y": Y_1000}) model = LinearRegression().fit(df[["X"]], df["Y"]) permutation_slopes[i] = model.coef_[0]sns.histplot(permutation_slopes, bins=30)plt.axvline(0.03797762, color='red', linestyle='dashed', linewidth=2)greater_than_observed_slope = sum(slope > 0.03797762 for slope in permutation_slopes)percentile_observed = (greater_than_observed_slope / len(permutation_slopes)) * 100print(f"The percentage of values greater than the observed slope 0.03797762 is: {percentile_observed}%")xxxxxxxxxxAfter running a permutation with 1000 sample values of our gdp and depression data, we compared our observed regression slope to the permuted slope values. Both our histogram and percentile calculations show that our observed slope is in the 99th percentile, which means that it is highly significant and GDP has a relevant and significant impact on depression rates. However, our analysis demonstrates a relationship in the positive direction, which doesn't support our alternative hypothesis. After running a permutation with 1000 sample values of our gdp and depression data, we compared our observed regression slope to the permuted slope values. Both our histogram and percentile calculations show that our observed slope is in the 99th percentile, which means that it is highly significant and GDP has a relevant and significant impact on depression rates. However, our analysis demonstrates a relationship in the positive direction, which doesn't support our alternative hypothesis.
xxxxxxxxxxHypothesis 2:Null Hypothesis: There is no statistically significant difference in anxiety disorders between countries which experience an increase in unemployment rate and in countries with a decrease in unemployment rate. Alternative Hypothesis: An increase in anxiety disorders is more common in countries with an increase in unemployment rates, potentially reflecting the impact of economic stability on mental health. - Note: We made a tiny error on the original hypothesis saying an increase in anxiety disorders is more common in countries with a DECREASE in unemployment rate. On phase four, we fixed that error and changed our hypothesis to an increase in anxiety disorders is more common in countries with a INCREASE in unemployment rates. Hypothesis 2: Null Hypothesis: There is no statistically significant difference in anxiety disorders between countries which experience an increase in unemployment rate and in countries with a decrease in unemployment rate.
Alternative Hypothesis: An increase in anxiety disorders is more common in countries with an increase in unemployment rates, potentially reflecting the impact of economic stability on mental health.
#First step is to calculate the slope of unemployment rate in countries from 2000 upto 2015 and see which countries #have a positive slope showing an increase in unemployment rate#subset data to just include country, year and unemployment ratedf_unemployment = final_df[['Country', 'Year','Unemployment Rate']]df_unemployment.head()# Pivot the data to have years as columnspivoted_unemploy = df_unemployment.pivot(index='Country', columns='Year', values='Unemployment Rate')# Create a new DataFrame to store the slopesslopes_df = pd.DataFrame(index=pivoted_unemploy.index, columns=['Slope'])# Fit a linear regression model for each country and store the slopefor country in pivoted_unemploy.index: y = pivoted_unemploy.loc[country].values X = sm.add_constant(range(len(y))) # Add a constant term for the intercept model = sm.OLS(y, X).fit() slopes_df.loc[country, 'Slope'] = model.params[1] # The slope coefficient# Filter countries with a positive slope, indicating an increasing trendcountries_with_increasing_rate = slopes_df[slopes_df['Slope'] > 0].indexprint("Countries with a increasing unemployment rate trend:")print(countries_with_increasing_rate.tolist())xxxxxxxxxx#The second step is to subset the original dataframe so that we have two separate dataframes, one with increasing #unemplyment rates and another with decreasing unemployment rates through the years.# Create subsets for countries with increasing and decreasing unemployment ratesunemploy_increasing = final_df[final_df['Country'].isin(countries_with_increasing_rate)]unemploy_decreasing = final_df[~final_df['Country'].isin(countries_with_increasing_rate)]print(unemploy_increasing.head())xxxxxxxxxx#We can then perform two sample t-test to compare the means of the two groups. #The null hypothesis (H0) assumes that there is no significant difference between the means of the two groups.#The alternative hypothesis (HA) assumes that there is a significant difference between the means of the two groups.# Perform a two-sample t-testt_stat, p_value = ttest_ind(unemploy_increasing['Anxiety disorders (%)'],\ unemploy_decreasing['Anxiety disorders (%)'], equal_var=False)print(f'T-test statistic: {t_stat}')print(f'P-value: {p_value}')# Determine statistical significancealpha = 0.05if p_value < alpha: print('Reject the null hypothesis. There is a statistically significant difference in means.')else: print('Fail to reject the null hypothesis. There is no statistically significant difference in means.')xxxxxxxxxx### AnalysisBased on the t-test we ran, we can see that the p-value is less than 0.05. That means we reject the null, and there is a significant difference in the mean percentage of anxiety disorders in countries with increasing unemployment rate and countries with decreasing unemployment rate. Now let us try and understand that difference by using visualizations and descriptive statistics.Based on the t-test we ran, we can see that the p-value is less than 0.05. That means we reject the null, and there is a significant difference in the mean percentage of anxiety disorders in countries with increasing unemployment rate and countries with decreasing unemployment rate. Now let us try and understand that difference by using visualizations and descriptive statistics.
xxxxxxxxxx# Descriptive statisticsstats_increasing = unemploy_increasing['Anxiety disorders (%)'].describe()stats_decreasing = unemploy_decreasing['Anxiety disorders (%)'].describe()print("Descriptive Statistics - Increasing Unemployment Rate:")print(stats_increasing)print("\nDescriptive Statistics - Decreasing Unemployment Rate:")print(stats_decreasing)xxxxxxxxxx# Box plotsplt.figure(figsize=(10, 6))plt.boxplot([unemploy_increasing['Anxiety disorders (%)'], unemploy_decreasing['Anxiety disorders (%)']], \ labels=['Increasing', 'Decreasing'])plt.title('Box Plot of Anxiety Disorder Rates')plt.ylabel('Percentage of Anxiety Disorders')plt.xlabel('Unemplyment Rate')plt.show()xxxxxxxxxx### AnalysisFrom the descriptive statistics and the box plot, we can see that the mean percentage anxiety disorder for countries with increasing unemployment rate(4.208201) is higher than countries with decreasing unemployment rate(3.830980). From the descriptive statistics and the box plot, we can see that the mean percentage anxiety disorder for countries with increasing unemployment rate(4.208201) is higher than countries with decreasing unemployment rate(3.830980).
xxxxxxxxxx## 2. Linear RegressionWe can also run regression to see the relationship between increasing unemployment rate and anxiety disorders. But first, let us check for heteoskedasticity by making a residual plot for unemployment rate and anxiety disorders rates.We can also run regression to see the relationship between increasing unemployment rate and anxiety disorders. But first, let us check for heteoskedasticity by making a residual plot for unemployment rate and anxiety disorders rates.
xxxxxxxxxxsns.residplot(data=unemploy_increasing, x = unemploy_increasing["Unemployment Rate"], \ y = unemploy_increasing['Anxiety disorders (%)'])plt.show()xxxxxxxxxxWe can see that the residual plot shows a relatively random distribution showing a linear relationship between unemployment rate and percentage of anxiety disorders. There is no need for transformation so we can go ahead with our linear regression for countries with increasing unemployment rate.We can see that the residual plot shows a relatively random distribution showing a linear relationship between unemployment rate and percentage of anxiety disorders. There is no need for transformation so we can go ahead with our linear regression for countries with increasing unemployment rate.
xxxxxxxxxx# Define dependent and independent variablesy = unemploy_increasing['Anxiety disorders (%)']X = unemploy_increasing[['Unemployment Rate']]# Add a constant term to the predictor matrixX = sm.add_constant(X)# Fit the linear regression modelmodel = sm.OLS(y, X).fit()print(model.summary())xxxxxxxxxx### AnalysisWe can interpret this as for every 1 unit increase in unemployment rate, anxiety disorders (%) increases by 0.0085. However, the pvalue is 0.300, which is more than 0.05, and indicates that the relationship between unempleyment rate and anxiety disorders is not higly significant. We can interpret this as for every 1 unit increase in unemployment rate, anxiety disorders (%) increases by 0.0085. However, the pvalue is 0.300, which is more than 0.05, and indicates that the relationship between unempleyment rate and anxiety disorders is not higly significant.
xxxxxxxxxx## PermutationTo determine if the 0.0085 actually represents a relationship between increasing unemployment rate and depression levels, we can permute those columns.To determine if the 0.0085 actually represents a relationship between increasing unemployment rate and depression levels, we can permute those columns.
xxxxxxxxxx#Sample random pairsX_900 = unemploy_increasing["Unemployment Rate"].sample(n=900)Y_900 = unemploy_increasing["Anxiety disorders (%)"].sample(n=900)permutation_slopes = np.zeros(1000)for i in range(1000): permuted_X = np.random.choice(X_900, 900, replace=False) df = pd.DataFrame({"X": permuted_X, "Y": Y_900}) model = LinearRegression().fit(df[["X"]], df["Y"]) permutation_slopes[i] = model.coef_[0]sns.histplot(permutation_slopes, bins=30)plt.axvline(0.0085, color='red', linestyle='dashed', linewidth=2)greater_than_observed_slope = sum(slope > 0.0085 for slope in permutation_slopes)percentile_observed = (greater_than_observed_slope / len(permutation_slopes)) * 100print(f"The percentage of values greater than the observed slope ({0.0085}) is: {percentile_observed}%")xxxxxxxxxxOur objective was to see the red dashed line towards the right tail of the distribution. However, having around 14.6% of values greater than the observed slope suggests that the observed relationship may not be significantly stronger or more positive than what random chance would produce. That means although we see a positive relationship between increasing unemployment rate and anxiety disorders, based on the data we have, this might have happened due to random chance and not necessarily due to the existence of a positive relationship.Our objective was to see the red dashed line towards the right tail of the distribution. However, having around 14.6% of values greater than the observed slope suggests that the observed relationship may not be significantly stronger or more positive than what random chance would produce. That means although we see a positive relationship between increasing unemployment rate and anxiety disorders, based on the data we have, this might have happened due to random chance and not necessarily due to the existence of a positive relationship.
xxxxxxxxxx## Hypothesis 3Null Hypothesis: None of the socioeconomic factors are able to predict any of the mental health disorder rates. Alternative Hypothesis: The combined influence of every significant socioeconomic factor (GDP, Life Expectancy, Urbanization Rate, HDI, Unemployment Rate) can help predict at least one of the mental health disorders. - Note: Hypothesis 3 was not preregistered but we believed it would be relevant to test as many independent and dependent variables in creating our final model.Null Hypothesis: None of the socioeconomic factors are able to predict any of the mental health disorder rates.
Alternative Hypothesis: The combined influence of every significant socioeconomic factor (GDP, Life Expectancy, Urbanization Rate, HDI, Unemployment Rate) can help predict at least one of the mental health disorders.
xxxxxxxxxxWe have our independent variables ('Life expectancy ', 'GDP, at current prices - US Dollars', 'HDI', 'Urbanization Rate', 'Unemployment Rate') and our dependent variables ('Schizophrenia (%)', 'Bipolar disorder (%)','Eating disorders (%)', 'Anxiety disorders (%)','Drug use disorders (%)', 'Depression (%)', 'Alcohol use disorders (%)'). Since we have multiple predictor variables, there is a high risk of multicollinearity where there independent variables could be highly correlated with each other. So before we run our OLS regression, we will run a multicollinearity test.We have our independent variables ('Life expectancy ', 'GDP, at current prices - US Dollars', 'HDI', 'Urbanization Rate', 'Unemployment Rate') and our dependent variables ('Schizophrenia (%)', 'Bipolar disorder (%)','Eating disorders (%)', 'Anxiety disorders (%)','Drug use disorders (%)', 'Depression (%)', 'Alcohol use disorders (%)'). Since we have multiple predictor variables, there is a high risk of multicollinearity where there independent variables could be highly correlated with each other. So before we run our OLS regression, we will run a multicollinearity test.
xxxxxxxxxx#Check for multicollinearity between independent variables by running a correlation matrixindependent_vars = ['Life expectancy ', 'GDP, at current prices - US Dollars', 'HDI',\ 'Urbanization Rate', 'Unemployment Rate']dependent_vars = ['Schizophrenia (%)', 'Bipolar disorder (%)',\ 'Eating disorders (%)', 'Anxiety disorders (%)',\ 'Drug use disorders (%)', 'Depression (%)', 'Alcohol use disorders (%)']correlation_matrix = final_df[independent_vars].corr()correlation_matrixxxxxxxxxxx### Interpretation- Life expectancy has a moderate to strong positive correlation with HDI (0.898), Urbanization Rate (0.664), and a weaker positive correlation with GDP (0.332).- GDP has a weak to moderate positive correlation with HDI (0.343), Urbanization Rate (0.290), and a weak negative correlation with Unemployment Rate (-0.042).- HDI has a strong positive correlation with Life expectancy (0.899) and Urbanization Rate (0.765), and a moderate positive correlation with GDP (0.343).- Urbanization Rate has moderate to strong positive correlations with Life expectancy (0.664), HDI (0.765), and a weak positive correlation with GDP (0.290) and Unemployment Rate (0.133).- Unemployment Rate shows very weak correlations with the other variables: weak negative correlation with GDP (-0.042) and weak positive correlation with Urbanization Rate (0.133).xxxxxxxxxxHigh correlations among predictors might indicate multicollinearity, which can affect the stability and interpretation of the regression coefficients. We will then check the variance inflation factor (VIF) for each predictor to quantify the severity of multicollinearity. If multicollinearity is detected (VIF values greater than 5 or 10), we plan to deal with it by removing highly correlated predictors.High correlations among predictors might indicate multicollinearity, which can affect the stability and interpretation of the regression coefficients. We will then check the variance inflation factor (VIF) for each predictor to quantify the severity of multicollinearity. If multicollinearity is detected (VIF values greater than 5 or 10), we plan to deal with it by removing highly correlated predictors.
xxxxxxxxxx# Select only the predictor variables (independent variables)predictors = final_df[['Life expectancy ', 'GDP, at current prices - US Dollars', 'HDI', \ 'Urbanization Rate', 'Unemployment Rate']]# Calculate VIF for each variablevif_data = pd.DataFrame()vif_data["Variable"] = predictors.columnsvif_data["VIF"] = [variance_inflation_factor(predictors.values, i) for i in range(predictors.shape[1])]print(vif_data)xxxxxxxxxx### Interpretation- Life expectancy: A VIF of 63.56 indicates extremely high multicollinearity with other variables. This high VIF suggests that 'Life expectancy' is highly correlated with one or more other predictor variables in our model.- GDP, at current prices - US Dollars: A VIF of 1.27 suggests low multicollinearity. This variable has minimal correlation with other predictors in the model.- HDI (Human Development Index): With a VIF of 5.50, this variable exhibits moderate multicollinearity. While not as high as 'Life expectancy', it indicates a notable correlation with other predictors.- Urbanization Rate: The VIF of 2.46 also indicates moderate multicollinearity, signifying a moderate correlation with other variables.- Unemployment Rate: The VIF of 0.99 suggests low multicollinearity, indicating minimal correlation with other predictors in the model.Life expectancy: A VIF of 63.56 indicates extremely high multicollinearity with other variables. This high VIF suggests that 'Life expectancy' is highly correlated with one or more other predictor variables in our model.
GDP, at current prices - US Dollars: A VIF of 1.27 suggests low multicollinearity. This variable has minimal correlation with other predictors in the model.
HDI (Human Development Index): With a VIF of 5.50, this variable exhibits moderate multicollinearity. While not as high as 'Life expectancy', it indicates a notable correlation with other predictors.
Urbanization Rate: The VIF of 2.46 also indicates moderate multicollinearity, signifying a moderate correlation with other variables.
Unemployment Rate: The VIF of 0.99 suggests low multicollinearity, indicating minimal correlation with other predictors in the model.
xxxxxxxxxxSince Life expectancy and HDI have the highest multicollinearity, we will remove them when we run our linear regression. Our next step is to make residual plots to check for any kind of heteroskedasticity amongst our variables.Since Life expectancy and HDI have the highest multicollinearity, we will remove them when we run our linear regression. Our next step is to make residual plots to check for any kind of heteroskedasticity amongst our variables.
xxxxxxxxxx#Residual Plotsindependent_vars = ['Log_GDP', 'Urbanization Rate', 'Unemployment Rate']dependent_vars = ['Schizophrenia (%)', 'Bipolar disorder (%)',\ 'Eating disorders (%)', 'Anxiety disorders (%)',\ 'Drug use disorders (%)', 'Depression (%)', 'Alcohol use disorders (%)']for dependent_var in dependent_vars: for independent_var in independent_vars: if dependent_var != independent_var: # Exclude plotting a variable against itself sns.residplot(data=final_df, x=final_df[independent_var], y=final_df[dependent_var]) plt.xlabel(independent_var) plt.ylabel(f"Residuals for {dependent_var}") plt.title(f"Residual Plot for {dependent_var} against {independent_var}") plt.show() xxxxxxxxxxOur residual plots display a random distribution of every independent variable plotted against every dependent variable. Log_GDP was transformed when we tested Hypoothesis 1, and now displays random distribution. Now that we have checked for heteroskedasticity, we can run our linear regression on our independent and dependent variables.Our residual plots display a random distribution of every independent variable plotted against every dependent variable. Log_GDP was transformed when we tested Hypoothesis 1, and now displays random distribution. Now that we have checked for heteroskedasticity, we can run our linear regression on our independent and dependent variables.
independent_vars = ['Log_GDP', 'Urbanization Rate', 'Unemployment Rate']dependent_vars = ['Schizophrenia (%)', 'Bipolar disorder (%)', 'Eating disorders (%)', 'Anxiety disorders (%)', 'Drug use disorders (%)', 'Depression (%)', 'Alcohol use disorders (%)']for dependent_var in dependent_vars: X = final_df[independent_vars] y = final_df[dependent_var] X = sm.add_constant(X) # Add a constant term model = sm.OLS(y, X).fit() # Fit the OLS model print(f"Regression results for {dependent_var}:") print(model.summary()) # Print the summary # Calculate coefficients from the model coef_interpretation = dict(zip(X.columns, model.params)) print("Interpretation:") for var in independent_vars: if var != 'const': print(f"For one unit increase in {var}, there is an expected change of {coef_interpretation[var]:.4f} in the {dependent_var}.") print("\n")xxxxxxxxxx### InterpretationWe can see that all p-values are les than 0.05, showing that the combined socioeconomic factors have a significant relationship with mental health disorder rates. We see that Schizophrenia, eating disorder, drug use disorder, and anxiety disorder are the disorders that have a significant relationship with all three socioeconomic factors (GDP, urbanization rate, unemployment rate). We can see that all p-values are les than 0.05, showing that the combined socioeconomic factors have a significant relationship with mental health disorder rates. We see that Schizophrenia, eating disorder, drug use disorder, and anxiety disorder are the disorders that have a significant relationship with all three socioeconomic factors (GDP, urbanization rate, unemployment rate).
xxxxxxxxxxindependent_vars = ['Log_GDP', 'Urbanization Rate', 'Unemployment Rate']dependent_vars = ['Schizophrenia (%)', 'Bipolar disorder (%)',\ 'Eating disorders (%)', 'Anxiety disorders (%)',\ 'Drug use disorders (%)', 'Depression (%)', 'Alcohol use disorders (%)']X = final_df[independent_vars]y = final_df[dependent_vars]for dependent in dependent_vars: y = final_df[dependent] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) #Cross validation kf = KFold(n_splits=3, shuffle=False, random_state=None) # Calculate Mean Absolute Error train_scores_rmse = cross_val_score(LinearRegression(), X_train, y_train, cv=kf, \ scoring='neg_root_mean_squared_error') test_scores_rmse = cross_val_score(LinearRegression(), X_test, y_test, cv=kf, \ scoring='neg_root_mean_squared_error') train_scores_mae = cross_val_score(LinearRegression(), X_train, y_train, cv=kf, \ scoring='neg_mean_absolute_error') test_scores_mae = cross_val_score(LinearRegression(), X_test, y_test, cv=kf, \ scoring='neg_mean_absolute_error') print(f"Mean Train {dependent} RMSE: ", round(np.mean(-train_scores_rmse), 5)) print(f"Mean Test {dependent} RMSE:", round(np.mean((-test_scores_rmse)), 5)) print(f"Mean Train {dependent} MAE: ", round(np.mean(train_scores_mae), 5)) print(f"Mean Test {dependent} MAE:", round(np.mean(test_scores_mae), 5))xxxxxxxxxxBoth the RMSE and MAE values represent error metrics. All our RMSE and MAE values display low error, which means that our model that combines all significant indepdent socioeconomic factors (GDP, Urbanization, Unemployment) together to predict each of the mental health disorders. Furthermore, we see that Schizophrenia has the MAE value clsoest to 0, so that means that our model best predicts Schizophrenia disorder. The disorder with the highest error is Anxiety disorder, which means that our model is least useful in predicting anxiety disorders. However, the test RMSE and MAE values are slightly lower than the train RMSE and MAE values, which indicates that our model might be prone to underfitting. The error values of every variable and the difference between the training and testing sets are small enough that we believe our model that combines GDP, urbanization rate, and unemployment rate can predict Schizophrenia, eating disorder, drug use disorder, and alcohol use disorder. Both the RMSE and MAE values represent error metrics. All our RMSE and MAE values display low error, which means that our model that combines all significant indepdent socioeconomic factors (GDP, Urbanization, Unemployment) together to predict each of the mental health disorders. Furthermore, we see that Schizophrenia has the MAE value clsoest to 0, so that means that our model best predicts Schizophrenia disorder. The disorder with the highest error is Anxiety disorder, which means that our model is least useful in predicting anxiety disorders. However, the test RMSE and MAE values are slightly lower than the train RMSE and MAE values, which indicates that our model might be prone to underfitting. The error values of every variable and the difference between the training and testing sets are small enough that we believe our model that combines GDP, urbanization rate, and unemployment rate can predict Schizophrenia, eating disorder, drug use disorder, and alcohol use disorder.
xxxxxxxxxx## Interpretation and ConclusionTo address our first hypothesis, our two sample t-test analysis revealed that there is a statistically significant correlation between GDP and depression, allowing us to reject our null hypothesis. However, we still failed to support our alternative hypothesis because our OLS and linear regression models coupled with our permutation test revealed a highly significant relationship between GDP and depression in the positive direction. Thus, we rejected our null hypothesis but failed to support our alternative hypothesis. Our analysis of our second hypothesis incorporated two sample t-test, box plot visualization, OLS regression, and permutation testing to determine if an increase in anxiety disorder was significantly related to an increase in unemployment rates. Our t-test allowed us to reject the null hypothesis and conclude that there is a statistically significant relationship between anxiety and unemployment rate. However, our OLS regression coefficient of 0.0085 had a p value of 0.300, which indicates even though we observed a positive relationship, the relationship isn’t highly significant. We further confirmed this through permutation testing, which found that 16.8% of randomly generated linear regression slopes were higher than 0.085, which indicates that our slope could have been observed due to random chance. One reason for this could be that there are many factors outside of the unemployment rate that may have a greater influence on rates of anxiety disorders. Our analysis demonstrates that unemployment rate may not best predict anxiety disorders. After testing our first two hypotheses, we aimed to create a model that would incorporate multiple socioeconomic factors into determining as many mental health disorder rates as possible. We did this by applying multicollinearity tests and VIF tests to determine that only GDP, urbanization rate, and unemployment rates are the least correlated with each other, and should be included in our final model. Our OLS regression model demonstrated that of all the mental health disorders, Schizophrenia, eating disorder, drug use disorder, and alcohol use disorder were significantly related to all three socioeconomic variables. We checked this using cross validation scores, where we split our dataset into 70% training and 30% testing, and applied 3 fold cross validation to generate RMSE and MAE values for every train and test output variable. We observed that all of our disorders had low error values as well as low differences between the training and testing sets. Thus, we conclude that our model which combines GDP, urbanization rate, and unemployment rate can accurately and best predict Schizophrenia, eating disorder, drug use disorder, and alcohol use disorder. To address our first hypothesis, our two sample t-test analysis revealed that there is a statistically significant correlation between GDP and depression, allowing us to reject our null hypothesis. However, we still failed to support our alternative hypothesis because our OLS and linear regression models coupled with our permutation test revealed a highly significant relationship between GDP and depression in the positive direction. Thus, we rejected our null hypothesis but failed to support our alternative hypothesis.
Our analysis of our second hypothesis incorporated two sample t-test, box plot visualization, OLS regression, and permutation testing to determine if an increase in anxiety disorder was significantly related to an increase in unemployment rates. Our t-test allowed us to reject the null hypothesis and conclude that there is a statistically significant relationship between anxiety and unemployment rate. However, our OLS regression coefficient of 0.0085 had a p value of 0.300, which indicates even though we observed a positive relationship, the relationship isn’t highly significant. We further confirmed this through permutation testing, which found that 16.8% of randomly generated linear regression slopes were higher than 0.085, which indicates that our slope could have been observed due to random chance. One reason for this could be that there are many factors outside of the unemployment rate that may have a greater influence on rates of anxiety disorders. Our analysis demonstrates that unemployment rate may not best predict anxiety disorders.
After testing our first two hypotheses, we aimed to create a model that would incorporate multiple socioeconomic factors into determining as many mental health disorder rates as possible. We did this by applying multicollinearity tests and VIF tests to determine that only GDP, urbanization rate, and unemployment rates are the least correlated with each other, and should be included in our final model. Our OLS regression model demonstrated that of all the mental health disorders, Schizophrenia, eating disorder, drug use disorder, and alcohol use disorder were significantly related to all three socioeconomic variables. We checked this using cross validation scores, where we split our dataset into 70% training and 30% testing, and applied 3 fold cross validation to generate RMSE and MAE values for every train and test output variable. We observed that all of our disorders had low error values as well as low differences between the training and testing sets. Thus, we conclude that our model which combines GDP, urbanization rate, and unemployment rate can accurately and best predict Schizophrenia, eating disorder, drug use disorder, and alcohol use disorder.